Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Algorithms for Binary Neural Networks

for 1-bit CNNs as

LB = ^λ

l=1

C^l

i=1

C^l

n=1

||^ˆk^l,i

n ⁻^w^l^◦^k^l,i

n ^||²

+ ν(k^l,i

n + ⁻^μ^l

i+⁾^T⁽^Ψ^l

i+⁾⁻¹⁽^k^l,i

n + ⁻^μ^l

i+⁾

+ ν(k^l,i

n −⁺^μ^l

i−⁾^T⁽^Ψ^l

i−⁾⁻¹⁽^k^l,i

n −⁺^μ^l

i−⁾

+ ν log(det(Ψ^l))

+ ^θ

m=1

||fm −cm||²

n=1

σ⁻²

m,n⁽^f^m,n ⁻^c^m,n⁾²^{+ log(}^σ²

m,n⁾

(3.108)

where k^l,i

n ^{, l}^∈{¹^{, ..., L}^}^{, i}^∈{¹^{, ..., C}^l

o^}^{, n}^∈{¹^{, ..., C}^l

i^}^{, is the vectorization of the}ⁱ^{-th kernel}

matrix at the l-th convolutional layer, w^lis a vector used to modulate k^l,i

n ^{, and}^μ^l

i ^and^Ψ^l

are the mean and covariance of the i-th kernel vector at the l-th layer, respectively. And

we term LB the Bayesian optimization loss. Furthermore, we assume that the parameters

in the same kernel are independent. Thus Ψ^l

i ^{becomes a diagonal matrix with the identical}

value (σ^l

i⁾²^{, where (}^σ^l

i⁾²^{is the variance of the}ⁱ^{-th kernel of the}^l^{-th layer. In this case, the}

calculation of the inverse of Ψ^l

i ^{is sped up, and all the elements of}^μ^l

i ^{are identical and equal}

to μ^l

i^{. Note that in our implementation, all elements of}^w^l^{are replaced by their average}

during the forward process. Accordingly, only a scalar instead of a matrix is involved in the

inference, and thus the computation is signiﬁcantly accelerated.

After training 1-bit CNNs, Bayesian pruning loss LP is then used for the optimization

of feature channels, which can be written as:

LP =

l=1

j=1

i=1

||K^l

i,j ⁻^K

j^||²

+ ν(K^l

i,j ⁻^K

j⁾^T⁽^Ψ^l

j⁾⁻¹⁽^K^l

i,j ⁻^K

j^{) +}^ν^log

det(Ψ^l

j⁾

(3.109)

where Jl is the number of Gaussian clusters (groups) of the l-th layer, and K^l

i,j^,ⁱ⁼

1, 2, ..., Ij, are those K^l

i^{’s that belong to the}^j^{-th group. In our implementation, we deﬁne}

Jl = int(C^l

o ^×^ϵ^{), where}^ϵ^{is a predeﬁned pruning rate. In this chapter, we use one}^ϵ^{for all}

layers. Note that when the j-th Gaussian just has one sample K^l

i,j^,^K

j ⁼^K^l

i,j ^and^Ψ^j^{is a}

unit matrix.

In BONNs, the cross-entropy loss LS, the Bayesian optimization loss LB, and the

Bayesian pruning loss LP are aggregated together to build the total loss as:

L = LS + LB + ζLP ,

(3.110)

where ζ is 0 in binarization training and becomes 1 in pruning. The loss of Bayesian kernels

constrains the distribution of the convolution kernels to a symmetric Gaussian mixture with

two modes. It simultaneously minimizes the quantization error through the ||^ˆk^l,i

n ⁻^w^l^◦^k^l,i

n ^||²

term. Meanwhile, the Bayesian feature loss modiﬁes the distribution of the features to reduce

intraclass variation for better classiﬁcation. The Bayesian pruning loss converges kernels

similar to their means and thus compresses the 1-bit CNNs further.

3.7.5

Forward Propagation

In forward propagation, the binarized kernels and activations accelerate the convolution

computation. The reconstruction vector is essential for 1-bit CNNs as described in Eq. 3.97,